Search CORE

73 research outputs found

Controlling for Unobserved Confounds in Classification Using Correlational Constraints

Author: Culotta Aron
Landeiro Virgile
Publication venue
Publication date: 03/05/2017
Field of study

As statistical classifiers become integrated into real-world applications, it is important to consider not only their accuracy but also their robustness to changes in the data distribution. In this paper, we consider the case where there is an unobserved confounding variable

z

that influences both the features

\mathbf{x}

and the class variable

y

. When the influence of

z

changes from training to testing data, we find that the classifier accuracy can degrade rapidly. In our approach, we assume that we can predict the value of

z

at training time with some error. The prediction for

z

is then fed to Pearl's back-door adjustment to build our model. Because of the attenuation bias caused by measurement error in

z

, standard approaches to controlling for

z

are ineffective. In response, we propose a method to properly control for the influence of

z

by first estimating its relationship with the class variable

y

, then updating predictions for

z

to match that estimated relationship. By adjusting the influence of

z

, we show that we can build a model that exceeds competing baselines on accuracy as well as on robustness over a range of confounding relationships.Comment: 9 page

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Co-training for Demographic Classification Using Deep Learning from Label Proportions

Author: Ardehaly Ehsan Mohammady
Culotta Aron
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/09/2017
Field of study

Deep learning algorithms have recently produced state-of-the-art accuracy in many classification tasks, but this success is typically dependent on access to many annotated training examples. For domains without such data, an attractive alternative is to train models with light, or distant supervision. In this paper, we introduce a deep neural network for the Learning from Label Proportion (LLP) setting, in which the training data consist of bags of unlabeled instances with associated label distributions for each bag. We introduce a new regularization layer, Batch Averager, that can be appended to the last layer of any deep neural network to convert it from supervised learning to LLP. This layer can be implemented readily with existing deep learning packages. To further support domains in which the data consist of two conditionally independent feature views (e.g. image and text), we propose a co-training algorithm that iteratively generates pseudo bags and refits the deep LLP model to improve classification accuracy. We demonstrate our models on demographic attribute classification (gender and race/ethnicity), which has many applications in social media analysis, public health, and marketing. We conduct experiments to predict demographics of Twitter users based on their tweets and profile image, without requiring any user-level annotations for training. We find that the deep LLP approach outperforms baselines for both text and image features separately. Additionally, we find that co-training algorithm improves image and text classification by 4% and 8% absolute F1, respectively. Finally, an ensemble of text and image classifiers further improves the absolute F1 measure by 4% on average

arXiv.org e-Print Archive

Crossref

Mining the Demographics of Political Sentiment from Twitter Using Learning from Label Proportions

Author: Ardehaly Ehsan Mohammady
Culotta Aron
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/08/2017
Field of study

Opinion mining and demographic attribute inference have many applications in social science. In this paper, we propose models to infer daily joint probabilities of multiple latent attributes from Twitter data, such as political sentiment and demographic attributes. Since it is costly and time-consuming to annotate data for traditional supervised classification, we instead propose scalable Learning from Label Proportions (LLP) models for demographic and opinion inference using U.S. Census, national and state political polls, and Cook partisan voting index as population level data. In LLP classification settings, the training data is divided into a set of unlabeled bags, where only the label distribution in of each bag is known, removing the requirement of instance-level annotations. Our proposed LLP model, Weighted Label Regularization (WLR), provides a scalable generalization of prior work on label regularization to support weights for samples inside bags, which is applicable in this setting where bags are arranged hierarchically (e.g., county-level bags are nested inside of state-level bags). We apply our model to Twitter data collected in the year leading up to the 2016 U.S. presidential election, producing estimates of the relationships among political sentiment and demographics over time and place. We find that our approach closely tracks traditional polling data stratified by demographic category, resulting in error reductions of 28-44% over baseline approaches. We also provide descriptive evaluations showing how the model may be used to estimate interactions among many variables and to identify linguistic temporal variation, capabilities which are typically not feasible using traditional polling methods

arXiv.org e-Print Archive

Crossref

Identifying leading indicators of product recalls from online reviews using positive unlabeled learning and domain adaptation

Author: Bhat Shreesh Kumara
Culotta Aron
Publication venue
Publication date: 01/03/2017
Field of study

Consumer protection agencies are charged with safeguarding the public from hazardous products, but the thousands of products under their jurisdiction make it challenging to identify and respond to consumer complaints quickly. From the consumer's perspective, online reviews can provide evidence of product defects, but manually sifting through hundreds of reviews is not always feasible. In this paper, we propose a system to mine Amazon.com reviews to identify products that may pose safety or health hazards. Since labeled data for this task are scarce, our approach combines positive unlabeled learning with domain adaptation to train a classifier from consumer complaints submitted to the U.S. Consumer Product Safety Commission. On a validation set of manually annotated Amazon product reviews, we find that our approach results in an absolute F1 score improvement of 8% over the best competing baseline. Furthermore, we apply the classifier to Amazon reviews of known recalled products; the classifier identifies reviews reporting safety hazards prior to the recall date for 45% of the products. This suggests that the system may be able to provide an early warning system to alert consumers to hazardous products before an official recall is announced

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Finding Truth in Cause-Related Advertising: A Lexical Analysis of Brands’ Health, Environment, and Social Justice Communications on Twitter

Author: Culotta Aron
Cutler Jennifer
Zheng Junzhe
Publication venue: ValpoScholar
Publication date: 06/07/2015
Field of study

Consumers increasingly desire to make purchasing decisions based on factors such as health, the environment, and social justice. In response, there has been a commensurate rise in cause-related marketing to appeal to socially-conscious consumers. However, a lack of regulation and standardization makes it difficult for consumers to assess marketing claims; this is further complicated by social media, which firms use to cultivate a personality for their brand through frequent conversational messages. Yet, little empirical research has been done to explore the relationship between cause-related marketing messages on social media and the true cause alignment of brands. In this paper, we explore this by pairing the marketing messages from the Twitter accounts of over 1,000 brands with third-party ratings of each brand with respect to health, the environment, and social justice. Specifically, we perform text regression to predict each brand’s true rating in each dimension based on the lexical content of its tweets, and find significant held-out correlation on each task, suggesting that a brand’s alignment with a social cause can be somewhat reliably signaled through its Twitter communications — though the signal is weak in many cases. To aid in the identification of brands that engage in misleading cause-related communication as well as terms that more likely indicate insincerity, we propose a procedure to rank both brands and terms by their volume of “conflicting” communications (i.e., “greenwashing”). We further explore how cause-related terms are used differently by brands that are strong vs. weak in actual alignment with the cause. The results provide insight into current practices in causerelated marketing in social media, and provide a framework for identifying and monitoring misleading communications. Together, they can be used to promote transparency in causerelated marketing in social media, better enabling brands to communicate authentic valuesbased policy decisions, and consumers to make socially responsible purchase decisions

Valparaiso University